Prediction of IMDb Scores for Scooby Doo Movies & TV Episodes

Michael Dixon, Caroline Klein, Lukas Muzila

5/8/23

Introduction to Topic and Motivation

  • Scooby Doo!
  • An iconic franchise with many movies and TV shows
  • Iterations vary dramatically in ratings
  • We wanted to investigate:

How can release and content details of Scooby Doo media be used to predict their audience rating?

Introduction to Data

  • Sourced from a user on data.world
  • Cleaned the data, going from 603 to 576 cases
    • Largely because rating weren’t included for the newest movies/episodes
  • Predictors:
  • Network
  • Run time
  • Format
  • Engagement (number of reviews on IMDb)
  • Release year
  • Number of suspects
  • Setting/terrain of episodes
  • Whether there were Scooby Snacks

Highlights from EDA: Summary Statistics

Distribution of IMDb Scores
Minimum Score Mean Score Maximum Score
4.2 7.276042 9.6
  • Range of 5.4 (4.2 to 9.6)
  • Want to explore which features might explain these differences
    • Started with simple visualizations

Highlights from EDA: Visualizations

  • Negative relationship between runtime and IMDb rating
  • Movies generally have lower ratings than TV shows
  • Some networks produce higher rated content than others:
    • Cartoon Network had the highest rated TV shows, but its movies weren’t great
    • The CW had the most unpopular TV shows

Inference and Modeling: Best Prediction Model

  • Tried OLS, Ridge, LASSO, and Random Forest prediction models
Model MSE
Linear 0.40656
Ridge 0.39537
Lasso 0.38594
RF 0.25264
  • Random Forest had the best performance (lowest test MSE)
    • Optimal parameters: 10 max features, 400 trees/estimators

Inference and Modeling: Variable Importance

  • Engagement is the most important variable according to RF model
  • CW network, run time, year, and TV series format also important
  • Linear model coefficients say that CW network is most influential variable (with largest coefficient)
    • Coefficient = -1.90, -1.73, and -1.85 for OLS, Ridge, LASSO

Final Points

Conclusions:

  • Had success using machine learning in predicting ratings of popular TV show
  • Random forest model performed the best (lowest MSE)
  • Saw some of the most important factors were network, format and engagement

Future Work:

  • Could experiment with more modeling techniques covered
  • Data collection of our own or from other source
  • Limited training data